monkey methods research group

futuristic play.

Monkeyin' Around: Is BitTorrent Dead?

Date: January 10, 2005

WARNING: “Monkeyin' Around” contains rambling and wild speculation on the future of digital media. Do not operate heavy machinery while reading. Read the first edition here. Visit our blog at http://blog.monkeymethods.org.

What the heck is this article about?

After the recent shutdowns in the BitTorrent community, notably the popular site SuprNova.org, many were left wondering if BitTorrent was on its last legs. You can read some of the coverage here . Since this happened, many people are asking: How big of a blow are these shutdowns? Is BitTorrent dead or dying?

Well, we had the same questions too, and decided we wanted to understand the distribution of torrent files on the Internet. Using this information, we can examine issues such as centralization and other important factors.

(If you want an introduction to BitTorrent, please read this Wired article and this FAQ)

Okay Sherlock, what did you guys do?

Well, first thing, we have some pretty interesting data lying around. One of the initial projects we decided to do as part of Monkey Methods was TowerSeek.org , which is a true crawler-based BitTorrent search engine. Unlike other sites that simply mirror either Google's torrent search functions (try “filetype:torrent induce” for example), SuprNova, or some other site, we wanted to build a real search engine that crawled the Internet automatically. We'll write more about this project soon, but you can give it a whirl right now.

As part of the backend, TowerSeek.org has a database of links to torrent files, which we realized could be used to understand the distribution of files on the Internet. This would tell us a couple important things:

•  How centralized are torrent files on the Internet?

•  Do torrent sites follow the 20/80 rule?

•  How long is the Long Tail?

These questions are all important because they concern vital (and interesting) differences between BitTorrent and other P2P protocols. Unlike Kazaa, Gnutella, and any others, BitTorrent has a fundamentally “web-based” interface. That means you go to a website in your browser (preferably Firefox), click on a link from that trusted site, and download. So you would expect these sites to vaguely follow the same distributions as websites on the Internet.

Also, through the same mechanisms, the architecture of BitTorrent is far more centralized than other P2P networks. For each file, there is a central “tracker” that keeps track of what clients have what pieces of the file, so clients can talk to each other and download efficiently. Kill the tracker, and you kill the ability of any client to trade files with each other. It is for these reasons that BitTorrent is almost more similar to a direct-connect protocol like FTP or HTTP than a P2P network like Kazaa.

All of these architectural differences make it interesting to look at the data. To answer the questions from above, we did some UNIX pipe-fu to dump out the pages from the database, aggregate them, sort them, and put them in an Excel friendly format, all in one step. 5 minutes later, we were analyzing away.

What did you find?

We found a lot of interesting things. First of all, it should be noted that the dataset was from early December, and thus preserves the distribution of torrents before the recent site shutdowns. It may be interesting to look at this data again in a couple months and see how it has changed over time.

The first thing we did we to simply take the mean, median and mode:

Mean
176
Median
3
Mode
1

Wow. That's a very skewed distribution. It's clearly biased towards a smaller number of sites with many torrents, followed by a long, long tail. In fact, 1 torrent at a domain is the most common statistic. Let's take a look at the graph:

Figure 1:

Ah ha! We can see that this is the classic Zipf Law distribution, at least it looks like it from first glance. How close is it? Well, if we take the logarithm of both axes, we should see a straight line, as explained in this article . So let's do that:

Figure 2:


Okay, that's pretty close to a straight line, although there's a slight skew. We did a linear regression and you can see the skew versus a linear fit:

Figure 3:

Compared to the linear fit, you can see that the high-end of sites have significantly less torrents than predicted. This may be because of several reasons: One may be that the infrastructure to maintain that many actively traded files becomes an upper ceiling. Another reason is more interesting: It may be that the the high-end sites are actually normal, and it's that the long-tail extra extra long. In this case, a super long tail makes the high-end look small in comparison. Many other avenues of speculation exist.

In fact, regarding the long tail theory, if you did a histogram and put the number of torrents in buckets, you can see that a huge number of domains have a grand total of 1 hosted torrent:

Figure 4:

So clearly, the long tail is quite long, but at the same time, the larger sites also contain a huge percentage of the torrents. In fact, if you look at the specific data, you can see that the centralization is quite bad. Instead of the top 20% having 80% of the files, you actually see 4% of the sites having 80% of the files.

After reviewing all the data, here are a couple of our key conclusions, which are further discussed below:

  1. Torrent files are extremely centralized, and do follow a Zipf Law-like distribution
  2. However, instead of 20/80, the distribution is more like 4/80.
  3. The long tail is very, very long - The number of sites with under 100 torrents numbers close to 87%.

These are all interesting findings. Now let's try to explain some of this…

Is this where the wild speculation comes in?

Yes. And rambling too, of course.

Centralization can be explained by network effects

The first way to explain why torrents are extremely centralized is because fundamentally, the sites act as marketplaces. Just as eBay is able to attract buyers, and thus more sellers, and thus more buyers, and so on, torrent sites use the same magic. Ultimately, these sites are simple groups of leechers and seeders getting together to “transact.”

Leechers go because they know they can find seeders, and get the files they want. Seeders go because they know they can find people interested in sharing their wealth. Add in a couple layers of community (forums, reputation, etc.), and you also create a trusted place to transact. This all acts as a network effect to mutually reinforce both groups of people going to the same site.

This is the main reason that a small number of sites constituted a huge percentage of the torrents. 10% of the domains having over 90% of the files is a big deal, and is very skewed towards centralized locations.

However, audiences are fragmented and diverse

That said, we found that the long tail was really quite long. 87% of the sites had a small number - under 100 torrents – hosted at that domain. In fact, there were close to 1000 sites with 10 or fewer torrents.

This tells you a couple things: First, it's obviously easy to host a torrent file. Simply uploading a torrent file into a directory at some ISP is no big deal. There are many people out there hosting torrents, 1, 2, or 3 at a time. Secondly, it means that there is probably a very diverse audience out there, who all want different kinds of files.

That means there is fragmentation that exists even out of the “supermarkets” of the largest torrent sites. The reason is because there are many niches of content out there, which fragments the addressable audience for any given file. So just as cable TV eventually created more choice, the number of torrent sites also indicates the same phenomenon occurring.

In the eBay analogy, just because the site trades everything doesn't mean that small sites like The Pit can't create a fluid market in specialty items.

What are the caveats on this research?

Well, there are lots, actually. Large BitTorrent communities sometimes require usernames and passwords, and the crawler simply can't access those files. Also, although the crawler has been in development and been active over many months, it hasn't had the opportunity to look through all the nooks and crannies on the web. So although it's a lot of interesting data, it's also true that we could be missing something.

In fact, as a quick plug: If you are someone who hosts torrent files or know someone who does, here are 3 simple ways to make BitTorrent search engines work more effectively. This is an important thing to standardize and solve, if the community wants to drive higher adoption.

So seriously, is BitTorrent dead?

No. Well, we don't think so, at least.

Although we were certainly interested in the fact that torrents were so heavily centralized, they grew this way organically, and did not require massive advertising budgets to grow their audience. If eBay were shutdown, another marketplace would simply take its place, by growing into the position. As large torrent sites are shut down, we predict that smaller sites will simply grow to take their place. The only cost to higher traffic is bandwidth, and luckily, online advertising revenues also scale with traffic. This creates an incentive for new sites to emerge and grow.

In this case, centralization is a feature, not a necessity. Just look at del.icio.us most popular and you'll see BitTorrent sites every couple days, as people uncover new places to find the files they're looking for.

But more importantly, the biggest thing we learned from this exercise was that many diverse groups of people are embracing BitTorrent, and the number of sites hosting torrent files is growing by the day. This fragmentation makes tracking down central sites difficult, if not impossible, and also shows how easy it is to host a front-end to torrents. Projects like BlogTorrent will only drive this trend more and more mainstream. And hopefully search engines like TowerSeek.org will help unite these disparate sources of information, and make things easy to find, regardless of where the files are.

For the people that aim to stop P2P, they have turned a centralized system like Napster – easily controlled, easily monitored – into a fully decentralized system in the form of Kazaa, as well as a fragmented ecosystem of thousands of centralized servers through BitTorrent. This was probably a bad decision. As the folks on Fark.com say, “hilarity ensues.”

If you have comments, contact us at monkeys [at] monkeymethods [dot] org or comment on our blog here.

monkey methods - futuristic play